Goto

Collaborating Authors

 entity recognizer


Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation

Ai, Chaoyi, Jiang, Yong, Huang, Shen, Xie, Pengjun, Tu, Kewei

arXiv.org Artificial Intelligence

Named entity recognition (NER) models often struggle with noisy inputs, such as those with spelling mistakes or errors generated by Optical Character Recognition processes, and learning a robust NER model is challenging. Existing robust NER models utilize both noisy text and its corresponding gold text for training, which is infeasible in many real-world applications in which gold text is not available. In this paper, we consider a more realistic setting in which only noisy text and its NER labels are available. We propose to retrieve relevant text of the noisy text from a knowledge corpus and use it to enhance the representation of the original noisy input. We design three retrieval methods: sparse retrieval based on lexicon similarity, dense retrieval based on semantic similarity, and self-retrieval based on task-specific text. After retrieving relevant text, we concatenate the retrieved text with the original noisy text and encode them with a transformer network, utilizing self-attention to enhance the contextual token representations of the noisy text using the retrieved text. We further employ a multi-view training framework that improves robust NER without retrieving text during inference. Experiments show that our retrieval-augmented model achieves significant improvements in various noisy NER settings.


Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Chu, Zhendong, Zhang, Ruiyi, Yu, Tong, Jain, Rajiv, Morariu, Vlad I, Gu, Jiuxiang, Nenkova, Ani

arXiv.org Artificial Intelligence

To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.


How to Train spaCy to Autodetect New Entities (NER) [Complete Guide]

#artificialintelligence

Named-entity recognition (NER) is the process of automatically identifying the entities discussed in a text and classifying them into pre-defined categories such as'person', 'organization', 'location' and so on. The spaCy library allows you to train NER models by both updating an existing spacy model to suit the specific context of your text documents and also to train a fresh NER model from scratch. This article explains both the methods clearly in detail. It is widely used because of its flexible and advanced features. Before diving into NER is implemented in spaCy, let's quickly understand what a Named Entity Recognizer is.


Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy

Partalidou, Eleni, Spyromitros-Xioufis, Eleftherios, Doropoulos, Stavros, Vologiannidis, Stavros, Diamantaras, Konstantinos I.

arXiv.org Machine Learning

This paper proposes a machine learning approach to part-of-speech tagging and named entity recognition for Greek, focusing on the extraction of morphological features and classification of tokens into a small set of classes for named entities. The architecture model that was used is introduced. The greek version of the spaCy platform was added into the source code, a feature that did not exist before our contribution, and was used for building the models. Additionally, a part of speech tagger was trained that can detect the morphology of the tokens and performs higher than the state-of-the-art results when classifying only the part of speech. For named entity recognition using spaCy, a model that extends the standard ENAMEX type (organization, location, person) was built. Certain experiments that were conducted indicate the need for flexibility in out-of-vocabulary words and there is an effort for resolving this issue. Finally, the evaluation results are discussed.


A multi-representational convolutional neural network architecture for text classification

#artificialintelligence

Over the past decade or so, convolutional neural networks (CNNs) have proven to be very effective in tackling a variety of tasks, including natural language processing (NLP) tasks. NLP entails the use of computational techniques to analyze or synthesize language, both in written and spoken form. Researchers have successfully applied CNNs to several NLP tasks, including semantic parsing, search query retrieval and text classification. Typically, CNNs trained for text classification tasks process sentences on the word level, representing individual words as vectors. Although this approach might appear consistent with how humans process language, recent studies have shown that CNNs that process sentences on the character level can also achieve remarkable results.


The Stanford Natural Language Processing Group

@machinelearnbot

The original CRF code is by Jenny Finkel. The feature extractors are by Dan Klein, Christopher Manning, and Jenny Finkel. Much of the documentation and usability is due to Anna Rafferty. More recent code development has been done by various Stanford NLP Group members. Stanford NER is available for download, licensed under the GNU General Public License (v2 or later).


Effective Bilingual Constraints for Semi-Supervised Learning of Named Entity Recognizers

Wang, Mengqiu (Stanford University) | Che, Wanxiang (Harbin Institute of Technology) | Manning, Christopher D. (Stanford University)

AAAI Conferences

Most semi-supervised methods in Natural Language Processing capitalize on unannotated resources in a single language; however, information can be gained from using parallel resources in more than one language, since translations of the same utterance in different languages can help to disambiguate each other. We demonstrate a method that makes effective use of vast amounts of bilingual text (a.k.a. bitext) to improve monolingual systems. We propose a factored probabilistic sequence model that encourages both crosslanguage and intra-document consistency. A simple Gibbs sampling algorithm is introduced for performing approximate inference. Experiments on English-Chinese Named Entity Recognition (NER) using the OntoNotes dataset demonstrate that our method is significantly more accurate than state-ofthe- art monolingual CRF models in a bilingual test setting. Our model also improves on previous work by Burkett et al. (2010), achieving a relative error reduction of 10.8% and 4.5% in Chinese and English, respectively. Furthermore, by annotating a moderate amount of unlabeled bi-text with our bilingual model, and using the tagged data for uptraining, we achieve a 9.2% error reduction in Chinese over the state-ofthe- art Stanford monolingual NER system.


Ontology-Based Named Entity Recognizer for Behavioral Health

Yasavur, Ugan (Florida International University) | Amini, Reza (Florida International University) | Lisetti, Christine (Florida International University) | Rishe, Naphtali (Florida International University )

AAAI Conferences

Named-Entity Recognizers (NERs) are an important part of information extraction systems in annotation tasks. Although substantial progress has been made in recognizing domain-independent named entities (e.g. location, organization and person), there is a need to recognize named entities for domain-specific applications in order to extract relevant concepts. Due to the growing need for smart health applications in order to address some of the latest worldwide epidemics of behavioral issues (e.g. over eating, lack of exercise, alcohol and drug consumption), we focused on the domain of behavior change, especially {\em lifestyle change}. To the best of our knowledge, there is no named-entity recognizer designed for the lifestyle change domain to enable applications to recognize relevant concepts. We describe the design of an ontology for behavioral health based on which we developed a NER augmented with lexical resources. Our NER automatically tags words and phrases in sentences with relevant (lifestyle) domain-specific tags (e.g. [un/]healthy food, potentially-risky/healthy activity, drug, tobacco and alcoholic beverage). We discuss the evaluation that we conducted with with manually collected test data. In addition, we discuss how our ontology enables systems to make further information acquisition for the recognized named entities by using semantic reasoners.


Leveraging Wikipedia Characteristics for Search and Candidate Generation in Question Answering

Chu-Carroll, Jennifer (IBM T. J. Watson Research Center) | Fan, James (IBM T. J. Watson Research Center)

AAAI Conferences

Most existing Question Answering (QA) systems adopt a type-and-generate approach to candidate generation that relies on a pre-defined domain ontology. This paper describes a type independent search and candidate generation paradigm for QA that leverages Wikipedia characteristics. This approach is particularly useful for adapting QA systems to domains where reliable answer type identification and type-based answer extraction are not available. We present a three-pronged search approach motivated by relations an answer-justifying title-oriented document may have with the question/answer pair. We further show how Wikipedia metadata such as anchor texts and redirects can be utilized to effectively extract candidate answers from search results without a type ontology. Our experimental results show that our strategies obtained high binary recall in both search and candidate generation on TREC questions, a domain that has mature answer type extraction technology, as well as on Jeopardy! questions, a domain without such technology. Our high-recall search and candidate generation approach has also led to high overall QA performance in Watson, our end-to-end system.